Crime Rate Prediction in San Francisco City (2018)

This assignment is about predicting Crime Rate in San Francisco City. The dataset is available at this website https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-2018-to-Present/wg3w-h783

Please note that the data is only available for incidents reported after 2018 as there is a migration of system from source system.

A copy of this .csv is available under datasets (folder). Since it is updated daily, a time-stamped copy retrieved at 20210513 is used for this assignment

The target variable is the type of crime predicted - this is a multiclass discrete variable

0. Problem Statement : Given a location, time, are we able to predict the crime type that's going to happen?

( Let's find out! )

Dataset - Its Fields and Description

No. Field Remark Field Type Further Remarks
1 Incident Datetime The date and time when the incident occurred Datetime
2 Incident Date The date the incident occurred Date
3 Incident Time The time the incident occurred String
4 Incident Year The year the incident occurred, provided as a convenience for filtering String
5 Incident Day of Week The day of week the incident occurred String
6 Report Datetime Distinct from Incident Datetime, Report Datetime is when the report was filed. Datetime
7 Row ID A unique identifier for each row of data in the dataset Numeric
8 Incident ID This is the system generated identifier for incident reports. Incident IDs and Incident Numbers both uniquely identify reports, but Incident Numbers are used when referencing cases and report documents. Numeric
9 Incident Number The number issued on the report, sometimes interchangeably referred to as the Case Number. This number is used to reference cases and report documents. Numeric
10 CAD Number The Computer Aided Dispatch (CAD) is the system used by the Department of Emergency Management (DEM) to dispatch officers and other public safety personnel. Numeric
11 Report Type Code A system code for report types, these have corresponding descriptions within the dataset. String Likely to be related to Report Type Code
12 Report Type Description The description of the report type, can be one of: Initial; Initial Supplement; Vehicle Initial; Vehicle Supplement; Coplogic Initial; Coplogic Supplement String
13 Filed Online If non-emergency cases are filed through Coplogic (a self-service platform), this will be flagged as TRUE String TRUE or blank
14 Incident Code Incident Codes are the system codes to describe a type of incident. A single incident report can have one or more incident types associated. In those cases you will see multiple rows representing a unique combination of the Incident ID and Incident Code. Numeric
15 Incident Category A category mapped on to the Incident Code used in statistics and reporting. Mappings provided by the Crime Analysis Unit of the Police Department. String This is the target variable
16 Incident Subcategory A subcategory mapped to the Incident Code that is used for statistics and reporting. Mappings are provided by the Crime Analysis Unit of the Police Department. String
17 Incident Description The description of the incident that corresponds with the Incident Code. String
18 Resolution The resolution of the incident at the time of the report.Note: once a report is filed, the Resolution will not change.  String -Cite or Arrest Adult
-Cite or Arrest Juvenile
-Exceptional Adult
-Exceptional Juvenile
-Open or Active
*Unfounded
19 Intersection The 2 or more street names that intersect closest to the original incident separated by a backward slash (). String
20 CNN The unique identifier of the intersection for reference back to other related basemap datasets. Numeric
21 Police District The Police District where the incident occurred.  String
22 Analysis Neighborhood This field is used to identify the neighborhood where each incident occurs. Related to intersection field; a district has multiple neighbourhoods String
23 Supervisor District The legislative body no. that is assigned to the district - districts are numbered 1 through 11 - Not a 1-1 mapping to police district Numeric
24 Latitude The latitude coordinate in WGS84, spatial reference is EPSG:4326 ; EPSG:4326 is geographic, non-project coordinate system. It is used in lat, longs GPS displays" Numeric
25 Longitude The longitude coordinate in WGS84, spatial reference is EPSG:4326 ; EPSG:4326 is geographic, non-project coordinate system. It is used in lat, longs GPS displays" Numeric
26 shape Geolocation in OGC WKT format (e.g, POINT(37.4,-122.3) - ie. (longitude, latitude ) shape object
27 Neighborhoods Not clear- not explained in website Numeric

1. Drop obvious non-necessary columns

All these following columns are deemed as non-necessary for this report and will not be imported for this analysis

Field Reason for Not Being Imported
CAD Number This number is from Emergency System used in SF Police. Not important to the type of crime prediction
Filed Online Whether it is filed online is not impt to crime rate prediction
Intersection This is used to mark the intersection in map - not used. Latitude and longitude are more useful
CNN Related to Intersection in mapping dataset; not particulrly useful in prediction unless we are certain intersection influences crime rate
Supervisor District Not important in this analysis since crime offenders do not generally care about the legistrative body of the district during crime act

For now, the orig. no. of records = 457041

2) Let's do some cleaning now

2.1 ) Cleaning up 'Out of SF' records

Refer to this https://support.datasf.org/help/police-department-incident-reports-2018-to-present-overview

Addresses for incidents outside of SF - some cases are referred from outside SFPD districts. These will be marked as “Out of SF” in the Police District column and do not have associated geographic information.

We can remove this from the df

2.1.1 ) do a count of how many records are under Out Of SF

There's no point in keeping records with NaN latitude and NaN longitutde for 'Out of SF' At the same time, we can see that there ARE some records that are 'Out of SF ' and with non-empty (lat,long) So the documentation is not exactly correct in this aspect. Here we have 8261 records and will drop this from the df first

This takes us to 457041 - 8261 = 448780 records for now

We have 4530 records with Out of SF only

After cleaning these up, we are now left with 444250 records

2.2 ) Cleaning up 'Nan Lat Long' records

Let's take a look at the records now to see which columns have missing records

So far there are 14990 records with missing lat, long

Breakdown of Empty (lat,long) by districts

Breakdown of Empty (lat,long) records by Year

2.3 We shall take a look at Analysis_Neigbhourhood missing values

So it seems all the missing analysis_neighbourhoods belongs to 'Ingleside' only, we temp fillna for these records as 'Ingleside'

2.4 Let's take a look again at Incident Category & Incident_Subcategory

So it appears that both Incident_Category (null) and Incident_Subcategory (null) belongs to a whole set of records

We see if we find a similiar incident_category for such incident_description before.

If there are ,we can choose to replace it

So there appears to be no good match. Here's a proposed mapping from a best-effort approach after examing the description

Missing Incident_Category but having Incident_Desc!=null Map to Incident_Category Map to Incident_Subcategory
Assault, By Police Officers Assault Assault
Assault, Commission of While Armed Assault Assault
Auto Impounded Vehicle Impounded Vehicle Impounded
Cloned Cellular Phone, Use Other Offenses Other Offenses
Crimes Involving Receipts or Titles Fraud Fraud
Driving, Stunt Vehicle/Street Racing Traffic Violation Arrest Traffic Violation Arrest
Gun Violence Restraining Order Other Offenses Other Offenses
Gun Violence Restraining Order Violation Other Offenses Other Offenses
Military Ordinance Other Offenses Other Offenses
Procurement, Pimping, & Pandering Human Trafficking, Commercial Sex Acts Human Trafficking, Commercial Sex Acts
Public Health Order Violation, After Notification Other Offenses Other Offenses
Public Health Order Violation, Notification Other Offenses Other Offenses
SFMTA Muni Transit Operator-Bus/LRV Other Offenses Other Offenses
SFMTA Parking and Control Officer Other Offenses Other Offenses
Service of Documents Related to a Civil Drug Abatement and/or Public Nuisance Action Other Offenses Other Offenses
Theft, Animal, Att. Larceny Theft Larceny Theft - Other
Theft, Boat Larceny Theft Larceny Theft - Other

MORE CLEANING?? --END 2 - Let's take a look again at Incident Category & Incident_Subcategory

2.5 So what do we do with 'Neighborhoods' column ?

Plus Incident_No, Incidient No, Incidient_code,ReportType_Code,ReportType_Desc,RowID

As examined, there is no clear documentation on the web on what 'Neighborhoods' mean - we do not know if this means the neighorhood # or the location is near to X no. of neighbourhood

The decision is to drop this column altogether

In addition, these columns are not needed for analysis and will be dropped as well

2.6 It's time to deal with Category Correction and Binning

Few observations 1. Weapons Offence should be spelled as 'Weapons Offense'

spelling mistakes (?) such as 'Motor Vehicle Theft?'

2.7 It's time to deal with Outliers

fun thing to observe this is actually correct because there is no land at lat = 37.81, long = -122.39 )

refer to the folium map of SF

2.8 PREPROCESS ADD DATE_MONTH FOR ANALYSIS purpose

we extract the following fields from datetime [Incident_Datetime] for analysis purpose

3. Data Exploration Time

Obervations there are 42 classes of Incident_Category and Larceny Theft is the highest while some acts are really low in number. This is a imbalance multiclass scenerio and we take this in mind first

So we see that Larceny Theft is the highiest while some acts are really low in number. This is a imbalance multiclass scenerio and we take this in mind first

3.1 Distrbution of incidents across all categories

This is a visual aspect of how many Larceny Theft to the rest of the incidents

3.2 TOP 10 Incident Categories

3.3 Distribution of Incident Across Years

Looks like there is a gradual decrease of crime incidents from 2018

3.4 Distribution of Crimes over Months

Observation : Most crimes seem to occur in Jan , the least in Jun

However, we can take a look at this under different year, to see if this pattern occurs in general

Observation: It appears that there is a descending trend from Jan onwards, followed by an upward trend from Jun/Jul onwards. However, this is not observed in 2020 - as it can be seen that there is a massive drop from Jan'20 - Apr'20 (One of the reasons could be that the city having a lockdown? - there could be fewer people on the streets, and fewer policemen on patrol - ie. a lack of activity of public in general)

Since 2021 is not over yet, we are unable to see any trend. However, it is noticeable that even for Jan'21 , there is only 8473 reported incidents as compared to the rest of the years (2018,2019,2020)

3.5 Distribution of Crimes over Days (of Month )

Observation : There is no obvious trend here except that most of them seem to occur on 1st day of the month . To be fair, 31st is not likely to be accurate as some of the months do not have 31st

3.6 Distribution of Crimes over DayofWeek ( Mon...Tue...)

Observations : So it seems that most crimes across on Fri, and the least occured on a Sun.

3.7 Distribution of Crimes over Hour of Day

Observations : It is observed that there are several peaks (1) 12pm (2) 1800hrs (3) After 12midnight

In general, it also appears that most incidents occur after 12pm

After 0100hrs - 0600hrs, there is a decrease in the trend

3.8 Distrbution of Crimes over different police districts

Observation: It appears that Central has the highest number, followed by Northern, Mission

Let's take a visual peek at the crimes over a map

NOTE : This is the boundaries of each district

This is a Visual representation of intensity of crimes over the map of SF

Distrbution of Crimes over different police districts - END

4. Selecting Features And Encoding Districts

4.1 Basic Correlation with no added new features ( except extraction of date components )

NOTE

Most of these features look ok, they are not really correlated to one-another

4.2 Dummy Encoding the districts since they are categorial

4.3 Create a new field CrimeCounts_By_AnalysisNghrBood

This calculates the rate of crimes per neighborhood
The rationale behind this is that we have the neighborhood info [Analysis_Neighborhood] onhand, we can investigate if crime_rate is influenced by the count of crimes per neighborhood

4.4 Examining Correlation Matrix After Adding in new field

NOTE

It is observed that there is not much correlation of the dummy-encoded districts to the rest of the numeric features. The correlation of the (lat,long) is expected since each district are at different locations. The rest of them looks ok

4.5 EXAMIING FEATURE IMPORTANCE before training

4.5.1 Examining Feature importance through KBest

Observation : We see all these features do not present major problems at the moment

For now, this is a list of features selected for training the models

'Incident_Year'
'Incident_Date_Month'
'Incident_Datetime_DayOfMonth'
'Incident_Datetime_HrOfDay'
'Incident_Datetime_Minute'
'Incident_Date_dayofweek_num
'Latitude'
'Longitude'
'CrimeCounts_By_AnalysisNghrBood'

Districts are dummy coded
PD_is_Bayview
PD_is_Central
PD_is_Ingleside
PD_is_Mission
PD_is_Northern
PD_is_Park
PD_is_Richmond
PD_is_Southern
PD_is_Taraval
PD_is_Tenderloin

5. Playing with models

5.1 COMPARE AGAINST A NUMBER OF CLASSIFIER ALGOS

5.1 Now we will do a kfold across a number of classifiers to find out how they perform

The above will take ages to run. This is the accuracy and runtimes of one-run performed

model accuracy time(taken)

CART | 22.805 |42.79 seconds | RFC | 33.1364 |7.03 minutes| NB | 4.6642 |6.95 seconds| ADA | 28.207 |5.91 minutes| BaggingClassifier | 29.3815 |2.00 minutes| XGBOOST | 33.5687 |35.65 minutes| KNN | 23.2457 |19.9 minutes| GBC | 30.8148 |208.02 minutes|

After introducing 'CrimeCounts_By_AnalysisNghrBood'
model accuracy time(taken in minutes)

CART | 22.7837 | 0.43958842 | RFC | 33.186 | 7.090382024 | NB | 29.3342 | 0.119177258 | ADA | 28.6533 | 5.835134919 | BaggingClassifier | 29.3745 | 2.031154434 | XGBOOST | 33.5907 | 37.11402934 | KNN | 23.9512 | 21.97902185 | GBC | 31.4847 | 225.5859336 |

Of these, what is not shown is

So it appears, these are the initial estimates of the models that seem to strike a balance between time and accuracy

But we observe that the overall accuracy is not fanastic and only holding up at max. 33%

Since all of them are not perfoming above 33%, we take RFC as a reference and look closer into this

The strategy is to form a base model using the selected models , investigate the performance and fine tune any hyperparameters

5.2 A Base Model for RFC (RandomForestClassifer)

[Optional] : AUC Curve
https://sinyi-chou.github.io/classification-auc/

[Optional] : Relationship of AUC & PR
https://sinyi-chou.github.io/classification-pr-curve/
https://www.datascienceblog.net/post/machine-learning/interpreting-roc-curves-auc/
https://www.geeksforgeeks.org/precision-recall-curve-ml/
pr_roc.png
Reference : https://sinyi-chou.github.io/classification-pr-curve/

The PR curve focuses on the minority class, whereas the ROC curve covers both classes.
Reference: https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/

Obsevation : The accuracy from a base model RFC is at 33.68 and F1 is around 28.51.

And we see that a few classes are ending ip with 0.0 precision and 0.0 recall

We can check on the ClassBalance of training & test set

Visual of feature importance after RFC

5.2.1 Running Base with class_weight = Balanced

5.2.2 Running Base with class_weight using TRAIN SET

5.2.3 Create weights by smoothen weight method

But for this project, I elect to choose F1, Classification Report, ConfusionMatrix as a metric for easier assessment

(b) Using RFC , class_weight = 'balanced' & calc_weight_on_train or smoothen_weight_formula does not seem to help the situation either

5.3 We can create a IsWeekDay field to see if this improves the situation

5.3 IsWeekDay = 1 (for a weekday) and 0 if it is a WeekEnd

However, the score (F1) did not improve upon adding the [IsWeekDay] field. So it is not needed to include this field Now we can consider dropping those classes that have little/no precision. So far, these are the incident categories identified

Incident Category
Case Closure
Courtesy Report
Gambling
Human Trafficking (A), Commercial Sex Acts
Liquor Laws
Stolen Property
Traffic Collision
Traffic Violation Arrest
Vandalism
Vehicle Impounded
Vehicle Misplaced

5.4 Re-run RFC with lesser classes

Choose to drop these classes to see if can further improve

Incident Category
Homicide
Suicide
Rape

5.5 Actions Taken So Far to get to Base RFC

004_RFC_Accuracy_progress_BeforeTune.PNG

So far, the dropping of the minor classes showed the highest improvement on the F1 & Accuracy

6. Run a DT as a base model first

7. Here's a basemodel for AdaBoost

Observation : AdaBoost is really affected by the minor classes - except for the major classes, all the other classes are practically 0 precision + 0 recall

8. Let's do a RandomSearchCV For These Models

8.1 Let's do a RandomSearchCV for AdaBoost

8.2 Let's do a RandomSearchCV for RFC

8.3 Let's do a RandomSearchCV for DT

This completes the RandomSearchCV for DT, RFC and AdaBoost

8.4 We plug in the optimum parameters for DT, RFC , AdaBoost

8.4.1 Optimized Tuned RFC

8.4.2 Optimized Tuned DT

8.4.3 Optimized Tuned AdaBoost

Here's the relative improvement / decrease ( from base model ) to tuned hyperparameters

005_ComparisionOfBase_Vs_Tuned-2.PNG

Observation:
There is an improvement in Precision, Recall, Accuracy for RFC - the slight drop in F1 is due to increase in Recall
There seems to be no improvement for ADA
There is an decrease for DT - the increase in F1 is actually attributed to LArceny/Prostitution but all the other classes' recall have decreased